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Abstract 



We tackle the fundamental problem of Bayesian active learning with noise, where 
we need to adaptively select from a number of expensive tests in order to identify 
an unknown hypothesis sampled from a known prior distribution. In the case of 
noise-free observations, a greedy algorithm called generalized binary search (GBS) 
is known to perform near-optimally. We show that if the observations are noisy, 
perhaps surprisingly, GBS can perform very poorly. We develop EC 2 , a novel, 
greedy active learning algorithm and prove that it is competitive with the optimal 
policy, thus obtaining the first competitiveness guarantees for Bayesian active learn- 
ing with noisy observations. Our bounds rely on a recently discovered diminishing 
returns property called adaptive submodularity, generalizing the classical notion 
of submodular set functions to adaptive policies. Our results hold even if the tests 
have non-uniform cost and their noise is correlated. We also propose EffECX- 
TIVE, a particularly fast approximation of EC 2 , and evaluate it on a Bayesian 
experimental design problem involving human subjects, intended to tease apart 
competing economic theories of how people make decisions under uncertainty. 

1 Introduction 

How should we perform experiments to determine the most accurate scientific theory among com- 
peting candidates, or choose among expensive medical procedures to accurately determine a patient's 
condition, or select which labels to obtain in order to determine the hypothesis that minimizes general- 
ization error? In all these applications, we have to sequentially select among a set of noisy, expensive 
observations (outcomes of experiments, medical tests, expert labels) in order to determine which hy- 
pothesis (theory, diagnosis, classifier) is most accurate. This fundamental problem has been studied in 
a number of areas, including statistics [16], decision theory [12], machine learning [18, 7] and others. 
One way to formalize such active learning problems is Bayesian experimental design [6], where one 
assumes a prior on the hypotheses, as well as probabilistic assumptions on the outcomes of tests. The 
goal then is to determine the correct hypothesis while minimizing the cost of the experimentation. Un- 
fortunately, finding this optimal policy is not just NP-hard, but also hard to approximate [5]. Several 
heuristic approaches have been proposed that perform well in some applications, but do not carry theo- 
retical guarantees (e.g., [17]). In the case where observations are noise-free 1 , a simple algorithm, gen- 
eralized binary search 2 ( GBS) run on a modified prior, is guaranteed to be competitive with the optimal 
policy; the expected number of queries is a factor of 0(log n) (where n is the number of hypotheses) 
more than that of the optimal policy [14], which matches lower bounds up to constant factors [5]. 

The important case of noisy observations, however, as present in most applications, is much less 
well understood. While there are some recent positive results in understanding the label complexity 
of noisy active learning [18, 1], to our knowledge, so far there are no algorithms that are provably 

! This case is known as the Optimal Decision Tree (ODT) problem. 

2 GBS greedily selects tests to maximize, in expectation over the test outcomes, the prior probability mass of 
eliminated hypotheses (i.e., those with zero posterior probability, computed w.r.t. the observed test outcomes). 
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competitive with the optimal sequential policy, except in very restricted settings [15]. In this paper, we 
introduce a general formulation of Bayesian active learning with noisy observations that we call the 
Equivalence Class Determination problem. We show that, perhaps surprisingly, generalized binary 
search performs poorly in this setting, as do greedily (myopically) maximizing the information gain 
(measured w.r.t. the distribution on equivalence classes) or the decision-theoretic value of information. 
This motivates us to introduce a novel active learning criterion, and use it to develop a greedy active 
learning algorithm called the Equivalence Class Edge Cutting algorithm (EC 2 ), whose expected cost 
is competitive to that of the optimal policy. Our key insight is that our new objective function satisfies 
adaptive submodularity [9], a natural diminishing returns property that generalizes the classical notion 
of submodularity to adaptive policies. Our results also allow us to relax the common assumption 
that the outcomes of the tests are conditionally independent given the true hypothesis. We also 
develop the Efficient Edge Cutting approximate objective algorithm (EffECXtive), an efficient 
approximation to EC 2 , and evaluate it on a Bayesian experimental design problem intended to tease 
apart competing theories on how people make decisions under uncertainty, including Expected Value 
[21], Prospect Theory [13], Mean-Variance-Skewness [11] and Constant Relative Risk Aversion [19]. 
In our experiments, EffECXtive typically outperforms existing experimental design criteria such as 
information gain, uncertainty sampling, GBS, and decision-theoretic value of information. Our results 
from human subject experiments further reveal that EffECXtive can be used as a real-time tool 
to classify people according to the economic theory that best describes their behaviour in financial 
decision-making, and reveal some interesting heterogeneity in the population. 

2 Bayesian Active Learning in the Noiseless Case 

In the Bayesian active learning problem, we would like to distinguish among a given set of hypotheses 
T~i = {hi, . . . , h n } by performing tests from a set T = {1, . . . , N} of possible tests. Running test 
t incurs a cost of c(t) and produces an outcome from a finite set of outcomes X — {1, 2, . . . , £}. 
We let H denote the random variable which equals the true hypothesis, and model the outcome of 
each test t by a random variable X t taking values in X. We denote the observed outcome of test 
t by x t . We further suppose we have a prior distribution P modeling our assumptions on the joint 
probability P(H, Xi, . . . , X N ) over the hypotheses and test outcomes. In the noiseless case, we 
assume that the outcome of each test is deterministic given the true hypothesis, i.e., for each h G 
P(Xi, . . . ,Xat I H = h) is a deterministic distribution. Thus, each hypothesis h is associated 
with a particular vector of test outcomes. We assume, w.l.o.g., that no two hypotheses lead to the 
same outcomes for all tests. Thus, if we perform all tests, we can uniquely determine the true 
hypothesis. However in most applications we will wish to avoid performing every possible test, as 
this is prohibitively expensive. Our goal is to find an adaptive policy for running tests that allows us 
to determine the value of H while minimizing the cost of the tests performed. Formally, a policy tt 
(also called a conditional plan) is a partial mapping tt from partial observation vectors to tests, 
specifying which test to run next (or whether we should stop testing) for any observation vector x^. 
Hereby, x^ G X A is a vector of outcomes indexed by a set of tests A C T that we have performed 
so far 3 (e.g., the set of labeled examples in active learning, or outcomes of a set of medical tests that 
we ran). After having made observations x^, we can rule out inconsistent hypotheses. We denote 
the set of hypotheses consistent with event A (often called the version space associated with A) by 
V(A) := {h G T~L : P(h | A) > 0}. We call a policy feasible if it is guaranteed to uniquely determine 
the correct hypothesis. That is, upon termination with observation x^, it must hold that | V(x^) | = 1. 
We can define the expected cost of a policy tt by 

c(tt) :=^P(h)c(T(*,h)) 

h 

where T(7r, h) C T is the set of tests run by policy tt in case H = h. Our goal is to find a feasible 
policy 7r* of minimum expected cost, i.e., 

7r* = argmin {c(tt) : tt is feasible} (2.1) 

A policy tt can be naturally represented as a decision tree T 71 ", and thus problem (2.1) is often called 
the Optimal Decision Tree (ODT) problem. 

Unfortunately, obtaining an approximate policy tt for which c(tt) < c(tt*) • o(log(n)) is NP-hard [5]. 
Hence, various heuristics are employed to solve the Optimal Decision Tree problem and its variants. 

3 Formally we also require that (x t )teB G dom(7r) and AQB, implies (x t )teA G dom(7r) (c.f., [9]). 
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Two of the most popular heuristics are to select tests greedily to maximize the information gain (IG) 
conditioned on previous test outcomes, and generalized binary search (GBS). Both heuristics are 
greedy, and after having made observations will select 



where Alg G {IG,GBS}. Here, A 1G (t\x A ) := H(X r | x^) - E Xt „ Xt \ XA \R (X r |x^, x t )] is the 
marginal information gain measured with respect to the Shannon entropy H (X) := E x [— log 2 P(x)] , 
and A GBS (t|x^) := P(V(x A )) ~ E x ex P ( X t = * I x^)P(V(x^,X £ = x)) is the expected 
reduction in version space probability mass. Thus, both heuristics greedily chooses the test that 
maximizes the benefit-cost ratio, measured with respect to their particular benefit functions. They 
stop after running a set of tests A such that |V(x^)| = 1, i.e., once the true hypothesis has been 
uniquely determined. 

It turns out that for the (noiseless) Optimal Decision Tree problem, these two heuristics are equivalent 
[22], as can be proved using the chain rule of entropy. Interestingly, despite its myopic nature 
GBS has been shown [14, 7, 10, 9] to obtain near-optimal expected cost: the strongest known bound 
is c(tt gbs ) < c(tt*) (ln(l/_p min ) + 1) where p min := mm hen P(h). Let x 5 (/i) be the unique 
vector xs G X s such that P(xs | h) = 1. The result above is proved by exploiting the fact 
that fGBs{S,h) := 1 — P(V(xs(ft))) + P(h) is adaptive submodular and strongly adaptively 
monotone [9]. Call x^ a subvector of x# if A C B and P(x# | x^) > 0. In this case we write 
*U ^< x#. A function / : 2 r x H is called adaptive submodular w.r.t. a distribution P, if for any 
*U ^< x i3 and any test t it holds that A (t | x^) > A (t | x#), where 



Thus, / is adaptive submodular if the expected marginal benefits A (t | x A ) of adding a new test t can 
only decrease as we gather more observations. / is called strongly adaptively monotone w.r.t. P if, 
informally, "observations never hurt" with respect to the expected reward. Formally, for all A, all 

t £ A, and all x G A' we require E H [f{A, H) | x^] < E H [f(A U {t} , H) \ x A ,X t =x\. 

The performance guarantee for GBS follows from the following general result about the greedy 
algorithm for adaptive submodular functions (applied with Q = 1 and r] = p m in)'- 

Theorem 1 (Theorem 10 of [9] with a = 1). Suppose / : 2 r x H 4 R>o is adaptive submodular 
and strongly adaptively monotone with respect to P and there exists Q such that f(T, h) = Q for all 
h. Let r] be any value such that f(S,h) > Q — r\ implies f(S, h) = Qfor all sets S and hypotheses h. 



Then for self-certifying instances the adaptive greedy policy it satisfies c(tt) < c(7r*) ( In ( — ) + 1 



The technical requirement that instances be self-certifying means that the policy will have proof 
that it has obtained the maximum possible objective value, Q, immediately upon doing so. It is 
not difficult to show that this is the case with the instances we consider in this paper. We refer the 
interested reader to [9] for more detail. 

In the following sections, we will use the concept of adaptive submodularity to provide the first 
approximation guarantees for Bayesian active learning with noisy observations. 

3 The Equivalence Class Determination Problem and the EC 2 Algorithm 

We now wish to consider the Bayesian active learning problem where tests can have noisy outcomes. 
Our general strategy is to reduce the problem of noisy observations to the noiseless setting. To gain 
intuition, consider a simple model where tests have binary outcomes, and we know that the outcome 
of exactly one test, chosen uniformly at random unbeknown to us, is flipped. If any pair of hypotheses 
h ^ h! differs by the outcome of at least three tests, we can still uniquely determine the correct 
hypothesis after running all tests. In this case we can reduce the noisy active learning problem to 
the noiseless setting by, for each hypothesis, creating N "noisy" copies, each obtained by flipping 
the outcome of one of the N tests. The modified prior P' would then assign mass P'{h') = P(h) /N 
to each noisy copy h! of h. The conditional distribution P'{X.j- \ h') is still deterministic (obtained 
by flipping the outcome of one of the tests). Thus, each hypothesis hi in the original problem is 
now associated with a set Hi of hypotheses in the modified problem instance. However, instead of 
selecting tests to determine which noisy copy has been realized, we only care which set Hi is realized. 



t* = arg max A A i g (t \ x^) /c(t) 
teT 



A (t | x^) := E H if (A U {t} , H) - f(A, H) \ x^] . 
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The Equivalence Class Determination problem (ECD). More generally, we introduce the 
Equivalence Class Determination problem*, where our set of hypotheses H is partitioned into a set 
of m equivalence classes {Hi, . . . , H m } so that H = \^T=i anc * me § oa ^ * s t0 determine which 
class Hi the true hypothesis lies in. Formally, upon termination with observations we require 
that V(x^) C Hi for some i. As with the ODT problem, the goal is to minimize the expected cost 
of the tests, where the expectation is taken over the true hypothesis sampled from P. In §4, we will 
show how the Equivalence Class Determination problem arises naturally from Bayesian experimental 
design problems in probabilistic models. 

Given the fact that GBS performs near-optimally on the Optimal Decision Tree problem, a natural ap- 
proach to solving ECD would be to run GBS until the termination condition is met. Unfortunately, and 
perhaps surprisingly, GBS can perform very poorly on the ECD problem. Consider an instance with 
a uniform prior over n hypotheses, hi, . . . , h n , and two equivalence classes Hi := {hi : 1 < i < n} 
and H2 '•= {h n }. There are tests T = {1, . . . , n} such that hi(t) = l[i = t], all of unit cost. Hereby, 
1[A] is the indicator variable for event A. In this case, the optimal policy only needs to select test 
n, however GBS may select tests 1, 2, . . . , n in order until running test t, where H = h t is the true 
hypothesis. Given our uniform prior, it takes n/2 tests in expectation until this happens, so that GBS 
pays, in expectation, n/2 times the optimal expected cost in this instance. 

The poor performance of GBS in this instance may be attributed to its lack of consideration for the 
equivalence classes. Another natural heuristic would be to run the greedy information gain policy, 
only with the entropy measured with respect to the probability distribution on equivalence classes 
rather than hypotheses. Call this policy 7Tig. It is clearly aware of the equivalence classes, as it 
adaptively and myopically selects tests to reduce the uncertainty of the realized class, measured w.r.t. 
the Shannon entropy. However, we can show there are instances in which it pays Q(n/ log(n)) times 
the optimal cost, even under a uniform prior. Refer to Appendix B for details. 

The EC 2 algorithm. The reason why GBS fails is because reducing the version space mass does 
not necessarily facilitate differentiation among the classes Hi . The reason why 7Tig fails is that there 
are complementarities among tests; a set of tests can be far better than the sum of its parts. Thus, we 
would like to optimize an objective function that encourages differentiation among classes, but lacks 
complementarities. We adopt a very elegant idea from Dasgupta [8], and define weighted edges be- 
tween hypotheses that we aim to distinguish between. However, instead of introducing edges between 
arbitrary pairs of hypotheses (as done in [8]), we only introduce edges between hypotheses in different 
classes. Tests will allow us to cut edges inconsistent with their outcomes, and we aim to eliminate 
all inconsistent edges while minimizing the expected cost incurred. We now formalize this intuition. 

Specifically, we define a set of edges £ = Ui<i < j< m {{/i, h'} : h G Hi, h' G Hj}, consisting of all 
(unordered) pairs of hypotheses belonging to distinct classes. These are the edges that must be cut, by 
which we mean for any edge {/i, h'} G £, at least one hypothesis in {/i, h'} must be ruled out (i.e., 
eliminated from the version space). Hence, a test t run under true hypothesis h is said to cut edges 
£ t (h) := {{ft', h"} : h'(t) + h(t) or h"(t) + h(t)}. See Fig. 1(a) for an illustration. We define a 
weight function w : £ ^ R>o by w({h, h'}) := P(h) • P{h'). We extend the weight function to an 
additive (modular) function on sets of edges in the natural manner, i.e., w{£') := Xlee^ w ( e )- The 
objective f E c that we will greedily maximize is then defined as the weight of the edges cut (EC): 



The key insight that allows us to prove approximation guarantees for Jec is that Jec shares the same 
beneficial properties that make Jgbs amenable to efficient greedy optimization. We prove this fact, 
as stated in Proposition 2, in Appendix A. 

Proposition 2. The objective Jec is strongly adaptively monotone and adaptively submodular. 
Based on the objective Jec, we can calculate the marginal benefits for test t upon observations as 



4 Bellala et al. simultaneously studied ECD [2], and, like us, used it to model active learning with noise [3]. 
They developed an extension of GBS for ECD. We defer a detailed comparison of our approaches to future work. 




(3.1) 



teA 



A EC (i|x^) :=E H [f EC (AU {t} ,H) - f EC (A,H) \x A ]. 
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(a) The Equivalence Class Determination problem 



(b) Error model 



Figure 1 : (a) An instance of Equivalence Class Determination with binary test outcomes, shown with the set of 
edges that must be cut, and depicting the effects of test i under different outcomes, (b) The graphical model 
underlying our error model. 



We call the adaptive policy ttec that, after observing x^, greedily selects test 
t* G arg max £ Aec (t | x^) /c(t), the EC 2 algorithm (for equivalence class edge cutting). 

Note that these instances are self-certifying, because we obtain maximum objective value if and 
only if the version space lies within an equivalence class, and the policy can certify this condition 
when it holds. So we can apply Theorem 1 to show EC 2 obtains a \n(Q/rj) + 1 approximation to 
Equivalence Class Determination. Hereby, Q = w(£ ) = 1 — ^2AP(h G Hi)) 2 < 1 is the total 
weight of all edges that need to be cut, and rj = min eG £ w(e) > p^ in is a bound on the minimum 
weight among all edges. We have the following result: 

Theorem 3. Suppose P(h) is rational for all h G H. For the adaptive greedy policy ttec imple- 
mented by EC 2 it holds that 

c(ir EC ) < (2In(l/p min ) + l)c(7r*) ) 

where p m i n : = mm /ie"H Pi} 1 ) ™ the minimum prior probability of any hypothesis, and 7r* is the 
optimal policy for the Equivalence Class Determination problem. 

In the case of unit cost tests, we can apply a technique of Kosaraju et al. [14], originally developed 
for the GBS algorithm, to improve the approximation guarantee to 0(\og n) by applying EC 2 with a 
modified prior distribution. We defer details to the full version of this paper. 



A Fast Implementation of EC 2 . The time running time of EC 2 is dominated by the evaluations of 
Aec (t | x^). The naive way to compute Aec (t | x^) is to construct a graph on the n hypotheses with 
weighted edges as prescribed by EC 2 , and then see which edges are cut by t for each potential test 
outcome. Assuming there are £ possible outcomes of a test, and that we can evaluate h(t) in unit time 
for all h and t, this will take 0(n 2 £) time. With N tests, the total time per round of EC 2 is 0(Nn 2 £). 
However, there is a much faster way to compute A E c (t | x^). Note that A E c (t | x^) equals 



l E ( p n v ( x ^)) p n v ( x ^)) - p ( n * n v ^ P (Kj n V(x A , x t )) ) 



2 . 



Now, compute a(i,x t ) := P {Hi D V(x^, x t )) for each i and x, then compute f3(i) := P (Hi) = 
J2 X a(i, x). Next, compute j(x t ) := P (x t \ x^) = ^ i a(i, x t )/J2i All of these terms can 
be computed in total time 0{n) by iterating over the hypotheses and for each h adding P(h) to the 
appropriate terms (i.e., f3(i), a(i, x), and j(x) if h G Hi and h(t) = x). Using these variables, we 



Note that for any 



can rewrite A EC (t\x A ) as E Xt ^ Xt \x A \ (/^)/^C0 - a(i,x t )a(j,x t )^j 

m> V2, • • • , Vm e R, we have ViVj = (Y^iVi) 2 -J2iVh Usin g this equality, we can evaluate 

sums like i)P(j) - a(i, x t )a(j, x t yj in 0(m) time, where there are m equivalence classes. 

Hence the total time to evaluate A E c (t \ x^) is 0(n + m£) using this method. In a similar manner, 
we can reduce the running time still further to 0(n), by incrementally computing terms such as 

(52 i x )) 2 an( ^ 52 i a {h x ) 2 as we iterate through the hypotheses. The total time per round of EC 2 
is then O (Nn). Additionally, the number of evaluations the algorithm needs to make can often be 
significantly reduced in practice using the accelerated adaptive greedy algorithm, as discussed in [9]. 
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4 Bayesian Active Learning with Noise and the EffECXtive Algorithm 



We now address the case of noisy observations, using ideas from §3. With noisy observations, the 
conditional distribution P(Xi, . . . , Xn | h) is no longer deterministic. We model the noise using an 
additional random variable 6. Fig. 1(b) depicts the underlying graphical model. The vector of test 
outcomes xj- is assumed to be an arbitrary, deterministic function X7- : T-L x supp(9) — >> X N \ hence 
X7- I h is distributed as x.j-(h, 0/J where 0^ is distributed as P(6 \ h). For example, there might be 
up to s = I supp(9) I ways any particular disease could manifest itself, with different patients with 
the same disease suffering from different symptoms. 

In cases where it is always possible to identify the true hypothesis, i.e., X7-(/i, 0) ^ x.-j-(h',9') 
for all h ^ hi and all 0,0' G supp(9), we can reduce the problem to Equivalence Class De- 
termination with hypotheses {x7-(/i, 0) : h G T-L ^6 G supp(O)} and equivalence classes Hi := 
{x7~(/^, 0) : G supp(6)} for all i. Then Theorem 3 immediately yields that the approximation 
factor of EC 2 is at most 2 In (1/ min^ ^ P(/i, 0)) + 1, where the minimum is taken over all (/i, 0) in 
the support of P. In the unit cost case, running EC 2 with a modified prior a la Kosaraju et al. [14] 
allows us to obtain an 0(log \H\ + log | supp(6)|) approximation factor. Note this model allows us 
to incorporate noise with complex correlations. 

However, a major challenge when dealing with noisy observations is that it is not always possible 
to distinguish distinct hypotheses. Even after we have run all tests, there will generally still be 
uncertainty about the true hypothesis, i.e., the posterior distribution P(H | X7-) obtained using 
Bayes' rule may still assign non-zero probability to more than one hypothesis. If so, uniquely 
determining the true hypothesis is not possible. Instead, we imagine that there is a set V of possible 
decisions we may make after (adaptively) selecting a set of tests to perform and we must choose 
one (e.g., we must decide how to treat the medical patient, which scientific theory to adopt, or which 
classifier to use, given our observations). Thus our goal is to gather data to make effective decisions 
[12]. Formally, for any decision dGPwe take, and each realized hypothesis h, we incur some loss 
i(d, h). Decision theory recommends, after observing x^, to choose the decision d* that minimizes 
the risk, i.e., the expected loss, namely G arg min d E#[i?(<i, H) | x^]. 

A natural goal in Bayesian active learning is thus to adaptively pick observations, until we are 
guaranteed to make the same decision (and thus incur the same expected loss) that we would have 
made had we run all tests. Thus, we can reduce the noisy Bayesian active learning problem to 
the ECD problem by defining the equivalence classes over all test outcomes that lead to the same 
minimum risk decision. Hence, for each decision d G V, we define 



If multiple decisions minimize the risk for a particular X7-, we break ties arbitrarily. Identifying the 
best decision d G V then amounts to identifying which equivalence class Tid contains the realized 
vector of outcomes, which is an instance of ECD. 

One common approach to this problem is to myopically pick tests maximizing the 
decision-theoretic value of information (Vol): Av i(£|x^) := min^ E#[^(<i, H) | x^] — 
^x t ~x t |x^ [ mm d [£(d, H) I x^, x t }}. The Vol of a test t is the expected reduction in the expected 
loss of the best decision due to the observation of x t . However, we can show there are instances 
in which such a policy pays Q(n/ log(n)) times the optimal cost, even under a uniform prior on 
(/i, 6) and with | supp(B) | = 2. Refer to Appendix B for details. In contrast, on such instances 
EC 2 obtains an Oilog n) approximation. More generally, we have the following result for EC 2 as 
an immediate consequence of Theorem 3. 

Theorem 4. Fix hypotheses T-L, tests T with costs c(t) and outcomes in X, decision set V, and 
loss function L Fix a prior P(H, 6) and a function X7- : % x supp(9) — )> X N which define the 
probabilistic noise model. Let c(tt) denote the expected cost ofir incurs to find the best decision, i.e., 
to identify which equivalence class Hd the outcome vector X7- belongs to. Let 7r* denote the policy 
minimizing c(-), and let ttec denote the adaptive policy implemented by EC 2 . Then it holds that 



U d := {x r : d = &rgmmE H [£(d\H) | x r ]}. 



(4.1) 




where p' min := min^ {P(fc, 0) : P(h,0) > 0}. 
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If all tests have unit cost, by using a modified prior [14] the approximation factor can be improved to 
O (log \H\ + log | supp(6)|) as in the case of Theorem 3. 

The EffECXtive algorithm. For some noise models, 6 may have exponentially-large support. In 
this case reducing Bayesian active learning with noise to Equivalence Class Determination results in 
instances with exponentially-large equivalence classes. This makes running EC 2 on them challenging, 
since explicitly keeping track of the equivalence classes is impractical. To overcome this challenge, 
we develop EffECXtive, a particularly efficient algorithm which approximates EC 2 . 

For clarity, we only consider the — 1 loss, i.e., our goal is to find the most likely hypothesis (MAP 
estimate) given all the data X7-, namely h*(x.j-) := arg max^ P(h | X7-). Recall definition (4.1), and 
consider the weight of edges between distinct equivalence classes Hi and Hj : 

wiHixHi) = P(xt)P(*t) = ( E p ( x ^)) ( £ p ( x r)) = p ( x r e ^)P(X r e Hj). 

In general, P(¥^r £ can be estimated to arbitrary accuracy using a rejection sampling approach 
with bounded sample complexity. We defer details to a longer version of the paper. Here, we focus 
on the case where, upon observing all tests, the hypothesis is uniquely determined, i.e., P(H | X7-) is 
deterministic for all X7- in the support of P. In this case, it holds that P(Xj- G Hi) = P(H = hi). 
Thus, the total weight is 

£ w ^ x H i) = (E p (^)) 2 - E p (^) 2 = 1 - E p (^) 2 - 

i^j i i i 

This insight motivates us to use the objective function 

A m (t\*A) := \52P(X t = x\ x^)(^P(fti I * Al X t =x) 2 )] -J2 P ( h i I x -0 2 > 

x i i 

which is the expected reduction in weight from the prior to the posterior distribution. Note that 
the weight of a distribution 1 — J2i P(K) 2 is a monotonically increasing function of the Renyi 
entropy (of order 2), which is — \ log P(hi) 2 . Thus the objective A E ff can be interpreted as a 
(non-standard) information gain in terms of the (exponentiated) Renyi entropy. In our experiments, 
we show that this criterion performs well in comparison to existing experimental design criteria, 
including the classical Shannon information gain. Computing A E ff (t \ x^) requires us to perform one 
inference task for each outcome x of X t , and 0(n) computations to calculate the weight for each 
outcome. We call the algorithm that greedily optimizes A E ff the EffECXtive algorithm (since it 
uses an Efficient Edge Cutting approximate objective), and present pseudocode in Algorithm 1. 



Input: Set of hypotheses H; Set of tests T; prior distribution P; function /. 




begin 












while 3ft ^ ft' : P(h | x^) > and P(h' \ x^) > do 




foreach t e T \ A do 




A Eff (t|x^) := 


E x P(Xt =x\x A ) (EiP(hi 1 *A,X t = x) 2 ) 


-E^^lx^) 2 ; 


Select t* e arg max, A Eff (t x^) /c(t); Set A A U {t*} and observe outcome x t * ; 


end 







Algorithm 1: The EffECXtive algorithm using the Efficient Edge Cutting approximate objective. 



5 Experiments 

Several economic theories make claims to explain how people make decisions when the payoffs 
are uncertain. Here we use human subject experiments to compare four key theories proposed in 
literature. The uncertainty of the payoff in a given situation is represented by a lottery L, which is 
simply a random variable with a range of payoffs C := {^1, . . . , 4J- For our purposes, a payoff is 
an integer denoting how many dollars you receive (or lose, if the payoff is negative). Fix lottery 
L, and let pi := P [L = ti\. The four theories posit distinct utility functions, with agents prefer- 
ring larger utility lotteries. Three of the theories have associated parameters. The Expected Value 
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Figure 2: (a) Accuracy of identifying the true model with fixed parameters, (b) Accuracy using a grid of 
parameters, incorporating uncertainty in their values, (c) Experimental results: 11 subjects were classified into 
the theories that described their behavior best. We plot probability of classified type. 

theory [21] posits simply Uev(L) = E [L], and has no parameters. Prospect theory [13] posits 
Upt(L) = f(£i)w(pi) for nonlinear functions /(^) = i p i9 if i { > and /(^) = -A(-£^, if 
£i < 0, and w(pi) = e _ ( 1 °g( 1 M)) Q! [20]. The parameters Qpt = {p, A, a} represent risk aversion, 
loss aversion and probability weighing factor respectively. For Portfolio optimization problems, finan- 
cial economists have used value functions that give weights to different moments of the lottery [11]: 
Umvs(L) = w^/jL — w a cr + w v v, where &mvs = {^V, w aj w u } are the weights for the mean, stan- 
dard deviation and standardized skewness of the lottery respectively. In Constant Relative Risk Aver- 
sion theory [19], there is a parameter &crra — a representing the level of risk aversion, and the util- 
ity posited is Ucrra(L) = Y,iPd 1 i~ a /( 1 - a) if a ^ 1, and U C rra(L) = ^iPi lo g(^)> if a = L 

The goal is to adaptively select a sequence of tests to present to a human subject in order to 
distinguish which of the four theories best explains the subject's responses. Here a test t is a pair 
of lotteries, (Lj, L\). Based on the theory that represents behaviour, one of the lotteries would be 
preferred to the other, denoted by a binary response x t G {1,2}. The possible payoffs were fixed to 
jC = { — 10, 0, 10} (in dollars), and the distribution (^1,^2,^3) over the payoffs was varied, where 
Pi G {0.01, 0.99} U {0.1, 0.2, . . . , 0.9}. By considering all non-identical pairs of such lotteries, we 
obtained the set of possible tests. 

We compare five different methods: EffECXtive, greedily maximizing Information Gain (IG), 
Uncertainty Sampling 5 (US), minimizing Version Space (VS), and tests selected at Random. We 
first evaluated the ability of the algorithms to recover the true model based on simulated responses. 
We chose parameter values for the theories, such that they made distinct predictions, and were 
consistent with the values proposed in literature [13]. We drew 1000 samples of the true model 
and fixed the parameters of the model to some canonical values, Qpt = {0.9, 2.2, 0.9}, &mvs — 
{0.8, 0.25, 0.25}, Qcrra — 1- Responses were generated using a softmax function, with the 
probability of response x t — 1 given by P(x t = 1) = 1/(1 + e u ^ L ^~ u( ^ L ^). Fig. 2(a) shows the 
performance of the 5 methods, in terms of the accuracy of recovering the true model with the number 
of tests. We find that US and VS both perform significantly worse than Random in the presence 
of noise. EffECXtive outperforms InfoGain significantly, which outperforms Random. 

We also considered uncertainty in the values of the parameters, by setting p from 0.85-0.95, A from 
2.1-2.3, a from 0.9-1; from 0.8-1.0, w a from 0.2-0.3, w v from 0.2-0.3; and a from 0.9-1.0, all 
with 3 values per parameter. We generated 500 random samples by first randomly sampling a model 
and then randomly sampling parameter values. EffECXtive and InfoGain outperformed Random 
significantly, Fig. 2(b), although InfoGain did marginally better among the two. The increased 
parameter range potentially poses model identifiability issues, and violates some of the assumptions 
behind EffECXtive, decreasing its performance to the level of InfoGain. 

After obtaining informed consent, we tested 1 1 human subjects to determine which model fit their 
behaviour best. Laboratory experiments have been used previously to distinguish economic theories, 
[4], and here we used a real-time, dynamically optimized experiment that required fewer tests. 
Subjects were presented 30 tests using EffECXtive. To incentivise the subjects, one of these tests 



Uncertainty sampling greedily selects the test whose outcome distribution has maximum Shannon entropy. 
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was picked at random, and subjects received payment based the outcome of their chosen lottery. The 
behavior of most subjects (7 out of 10) was best described by EV. This is not unexpected given the 
unusually high quantitative abilities of the subjects. We also found heterogeneity in classification: 
One subject got classified as MVS, as identified by her violations of stochastic dominance in the last 
few tests. 2 subjects were best described by prospect theory since they exhibited a high degree of loss 
aversion and risk aversion. One subject was also classified as a CRRA-type (log-utility maximizer). 
Figure 2(c) shows the probability of the classified model with number of tests. Although we need 
a bigger and more diverse sample to make significant claims of the validity of different economic 
theories, our preliminary results indicate that subject types can be identified and there is heterogeneity 
in the population. It also serves as a practical application of real-time dynamic experimental design 
optimization that is necessary to collect data on human economic behavior. 

6 Conclusions 

In this paper, we considered the problem of adaptively selecting which noisy tests to perform in order 
to identify an unknown hypothesis sampled from a known prior distribution. We studied the Equiva- 
lence Class Determination problem as a means to reduce the case of noisy observations to the classic, 
noiseless case. We introduced EC 2 , an adaptive greedy algorithm that is guaranteed to choose the 
same hypothesis as if it had observed the outcome of all tests, and incurs near-minimal expected cost 
among all policies with this guarantee. This is in contrast to popular heuristics that are greedy w.r.t. 
version space mass reduction, information gain or value of information, all of which we show can be 
very far from optimal. EC 2 works by greedily optimizing an objective tailored to differentiate between 
sets of observations that lead to different decisions. Our bounds rely on the fact that this objective func- 
tion is adaptive submodular. We also develop EffECXtive, a practical algorithm based on EC 2 , that 
can be applied to arbitrary probabilistic models in which efficient exact inference is possible. We apply 
EffECXtive to a Bayesian experimental design problem, and our results indicate its effectiveness in 
comparison to existing algorithms. We believe that our results provide an interesting direction towards 
providing a theoretical foundation for practical active learning and experimental design problems. 
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A Additional Proofs 

Lemma 5. The objective function f ofEq. (3.1) is strongly adaptive monotone. 

Proof. We must show that for all x^, t £ A and possible answer x for test t that 

E H [f(AH) | x A ] <E H [f(AU{t},H) | x Aj X t = x] (A.l) 

Towards this end, it is useful to notice that for alH G T the function h \-> £ t (h) depends only on 
X t . Hence for any x A , the function h \-> f(A, h) is constant over realizations xj- >- x A , so we 
can define a function g(x A ) such that g(x A ) = E H [f(A, H) | x^] by g(x. A ) := w ({JteA £ t Ot)) 
where x^ = (x t ) teA and £ t (x) is the set of edges cut by t if X t = x. Note that for all x^ ^< x# we 
have g(x. A ) < g(x#), since the edge weights are nonnegative. Setting B = A U {t} yields Eq. (A.l) 
and hence implies strong adaptive monotonicity. □ 

Lemma 6. The objective function f ofEq. (3.1) is adaptive submodular for any prior with rational 
values. 

Proof. We first prove the result assuming a uniform prior P(-), and then show how to reduce the 
general prior case to the uniform prior case. Hence all edges have weight 1/n 2 , where there are 
n hypotheses. For convenience, we also rescale our units of reward so that all edges have unit 
weight. (Note that / is adaptive submodular iff cf is for any constant c > 0.) To prove adaptive 
submodularity, we must show that for all x^ -< x# and t G T, we have A (t \ x#) < A(t\x A ). 
Fix t and x^, and let V(x^) := {h : P(h \ x A ) > 0} denote the version space, if x^ encodes the 
observed outcomes. Let ny := |V(x^)| be the number of hypotheses in the version space. Likewise, 
let ni, a (x^) :=\{h: he V(x^, X t = a) n Hi} |, and let n a (x^) := YhLi ^,a(x^). Also, define 
e a ( x ^) : = \ J2i^j ^b^a n i,b(*A) ' n j,b(x A ) to be the number of edges cut such that at t both 
hypotheses agree with each other but disagree with the realized hypothesis h*, conditioning on 
X t = a. We define a function of these quantities such that A (t | x^) = 6(n(x A ),e(x A )) 9 where 
n(x^) is the vector consisting of n^ a (x^) for all i and a and e(x^) is the vector consisting of 
e a (*u) for all a. For brevity, we suppress the dependence of x^ where it is unambiguous. Then, as 
we will explain below, is defined as 

0(n,e) := \ 5Z5Z n *' a ' n ^ b + ^Z Ca \ 1 ~ ~^ ) ^ A ' 2 ^ 
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Here, i and j range over all class indices, and a and b range over all possible outcomes of test t. The 
first term on the right-hand side counts the number of edges that will be cut by selecting test t no 
matter what the outcome of t is. Such edges consist of hypotheses that disagree with each other 
at t and, as with all edges, lie in different classes. The second term counts the expected number 
of edges cut by t consisting of hypotheses that agree with each other at t. Such edges will be cut 
by t iff they disagree with ft* at t. The edges {ft, ft'} with ft, ft' G V(x^) and P(X t = a \ ft) = 
P(X t = a | ft') = 1 (of which there are e a ) will be cut by t iff X t ^ a. Since we assume a uniform 
prior, P [X t ^ a | x^] = 1 — n a /ny for any partial realization with t £ A, hence the expected 
contribution of these edges to A (t | x^) is e a (1 — n a /ny), from whence we get the second term. 

Now fix x# >- x^. Our strategy for proving A(t|x#) < A(£|x^) is as follows. As more 
observations are made, the version space can only shrink, i.e. V(x#) C V(x^). This means that 
for all i and a, n^ a is nonincreasing, i.e., ni ja (x.j$) < n^ a (x^). Note we may interpret e a as a 
function of the variables in {n^ a : 1 < i < m,a G X}, and that it is nondecreasing in each n^ a , so 
we may also deduce that e a (x#) < e a (x^) for all a. Hence we consider a parameterized path p(r) 
in R( m+1 H fromp(O) := (n(x#), e(x#)) to p(l) := (n(x^), e(x^)). Then by integrating along 
the path we obtain 

A(t\ XA )-A(t\ XB )= J^f^^jdr. (A.3) 

We require that at each point in p it holds that e a = \ X^/ a n hb * nj,b f° r a U a > an d a l so ensure 
that p is nondecreasing in each coordinate. There exists a path meeting these requirements, since 
(n(xg), e(x#)) < (n(x^), e(x^)) and each e a is nondecreasing in each variable. This implies 
drii }a /dr > and de a /dr > for all i and a. Hence we can prove the integral is nonnegative by 
applying the chain rule for the derivative to obtain 

d(0op) _ 00 duj^g | 36 de a 

dr ^ drii a dr ^ de a dr 

i,a ' a 

and then proving that dO/drii^ a > and dO/de a > for all i and a. Next, observe that dO/de a = 
(1 — n a /ny) > 0. So fix a class index k and an outcome c and consider d6/dnk, c - Elementary 
calculus tells us that 

' Jy^k, b^c b v 



This quantity is nonnegative iff 

n v e c < n\ • ^ n jib + ^ e 6 n 6 (A.5) 

jT^fc, 6^c b 

Now substitute | S&/ c ' f° r e c 10 obtain 

n v e c = ^ ^2^2n ijb -n jjb < n v • ^ n jib \ ^^ ?a < riy • ^ n j)6 J (A.6) 

Since J] 6 e&n6 > 0, we obtain Eq. (A.5) from Eq. (A.6) by inspection, and hence dO/drtk^ > for 
all k and c. This completes the proof of the adaptive submodularity of / under a uniform prior. 

We now show how to reduce the general prior case to the uniform prior case. Fix any prior P with 
rational probabilities, i.e. P(h) G Q for all h. Then there exists d G N and function k : H — )> N 
such that such that = k(h)/d. Create a new instance containing d hypotheses, where for each 
h G T~L there are k(h) copies of h, denoted by ft 1 , ... , ft fe ^. Each copy of ft induces the same 
conditional distribution of test outcomes P(Xi, . . . , Xn | ft). All copies of ft belong to the same 
class, and copies of ft and ft' belong to the same class iff ft and ft' do. Finally, assign a uniform 
prior to this new instance. Then the adaptive submodularity of / on this new instance implies the 
adaptive submodularity on the original instance, if the weight of edge {ft, ft'} in the original instance 
is proportional to the number of edges between the copies of ft and the copies of ft' in the new 
instance. That is, it suffices to set w({h, ft'}) ex fc(ft) • fc(ft'), and our choice of weight function, 
w({h, ft'}) := P(h) • P(ft'), satisfies this condition. □ 
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B A Bad Example for the Info-Gain and Value of Information Criteria 



A popular heuristic for the Optimal Decision Tree problem are to adaptively greedily select the test 
that maximizes the information gain in the distribution over hypotheses, conditioned on all previous 
test outcomes. The same heuristic can be applied to the Equivalence Class Determination problem, in 
which we compute the information gain with respect to the entropy of the distribution over classes 
rather than hypotheses. Let 7Tig denote the resulting policy for Equivalence Class Determination. 

Another common heuristic for Optimal Decision Tree is to adaptively greedily select the test max- 
imizing the Bayesian decision- theoretic value of information (Vol) criterion. Recall the value of 
information of a test t is the expected reduction in the expected risk of the minimum risk decision, 
where the risk is the expected loss. Formally, consider the Bayesian decision-theoretic setup described 
in §4. The Vol criterion myopically selects test to maximize 



Avoi(t|x^) :=mmE H [£(d,H) | x A ] - E x ^ XtlxA 



mmE H [£(d,H) | x^,x £ ] 



This heuristic can be also be applied to the Equivalence Class Determination problem, by taking the 
decision set V to be the set of equivalence classes, and the loss function to be the 0-1 classification 
loss function, i.e., £(d, H) = 1[H £ d]. Let 7r Vo i denote the resulting policy. 

In this section we present a family of Equivalence Class Determination instances for which both 
and 7Tvoi perform significantly worse than the optimal policy. 

Theorem 7. There exists a family of Equivalence Class Determination instances with uniform priors 
such that c(ttig) = Q(n/ log(n)) c(7r*) and c(ttv i) = Fl(n/ log(n)) c(ir*), where n is the number 
of hypotheses and 7r* is an optimal policy 

In fact, we will prove a lower bound for each policy within a large family of adaptive greedy policies 
which contains 7Tig and 7Tv i, which we call posterior-based. Informally, this family consists of all 
greedily policies that use only information about the posterior equivalence class distribution to select 
the next test. More precisely, these policies define a potential function which maps distributions of 
distributions over equivalence classes to real numbers, and at each time step select the test t which 
maximizes the of the posterior distribution (over test outcomes x t ) of the posterior distribution over 
equivalence classes generated by adding x t to the previously seen test outcomes. In the event of a tie, 
we select any test maximizing this quantity at random. The information gain policy is posterior-based; 

is simply — 1 times the expected entropy of the posterior equivalence-class distribution. Likewise, 
the value of information policy is also posterior-based; $ is simply —1 times the expected loss of the 
best action for the posterior equivalence-class distribution. Hence to prove Theorem 7 it suffices to 
prove the following more general theorem. 

Theorem 8. There exists a family of Equivalence Class Determination instances with uniform priors 
such that c(tt) = Q(n/ log(n)) c(7r*)for any posterior-based policy ir, where n is the number of 
hypotheses and 7r* is an optimal policy. 



Proof. Fix integer parameter q > 1. There are m — 2 q classes % a for each 1 < a < 2 q . Each 
1-L a consists of two hypotheses, h a ,o and h a ,i. We call a the index of T~L a . The prior is uniform 
over the hypotheses H = {h a , v : 1 < a < m, < v < 1} . There are four types of tests, all with 
binary outcomes and all of unit cost. There is only one test of the first type, to, which tells us 
the value of v in the realized hypothesis h* . Hence for all a, H = h ajV => X to = v. Tests of 
the second type are designed to help us quickly discover the index of the realized class via binary 
search if we have already run to, but to offer no information gain whatsoever if to has not yet been 
run. There is one such test tk for all t with 1 < k < q. For z G N, let fa (z) denote the k th 
least-significant bit of the binary encoding of z, so that z = YlkLi 2 k ~ 1( l ) k (z). Then for each h a ^ v 
we have H = h a ^ v => X tk = l[fa (a) — v\. Tests of the third type are designed to allow us to do a 
(comparatively slow) sequential search on the index of the realized class. Specifically, we have tests 
t^ q for all 1 < k < m, such that H = h a ^ v => X t **i = l[a = k]. Finally, tests of the fourth type, 
j^dumb . £ ^ | ^ are (jujnjny tQ S t s m at reveal no information at all. Formally, X t dumb always equals 
zero. 



12 



Given this input, suppose H = h a , v - One solution is to run to to find v, then run tests t±, . . . , t q to 
determine (j>k (a) for all 1 < k < q and hence to determine a. This reveals the value of H, and hence 
the class H belongs to. Since the tests have unit cost, this policy tt' has cost c(7r / ) = q + 1. 

Next, fix a posterior-based policy tt and consider what it will do. Call a class possible if not all of 
its hypotheses have been ruled out by tests performed so far. Note that all possible classes contain 
the same number of hypotheses, because they initially have two, and each test tk that can reduce the 
size of a possible class to one, will reduce the size of every possible class to one. This, and the fact 
that the prior is uniform, implies that the posterior equivalence-class distribution is uniform over the 
remaining possible classes. If no tests in {tk : < k < q} have been run, as is initially the case, any 
single test in this set will not change the posterior equivalence-class distribution. Hence, as measured 
with respect to <1>, such tests are precisely as good as the dummy tests. If these tests are each better 
than any test in {t s fc eq : 1 < k < m}, then tt selects among {t k : < k < q} U {^ umb : k G N} at 
random. Since there are infinitely many dummy tests, with probability one a dummy test is selected. 
Since the posterior remains the same, tt will repeatedly select a test at random from this set, resulting 
in an infinite loop as dummy tests are selected repeatedly ad infinitum. Otherwise, some test t s £ q is 
preferable to the other tests, measured with respect to In the likely event that t is not the index of 
H, we are left with a residual problem in which tests in {tk : < k < q} still have no effect on the 
posterior, there is one less class, and the prior is again uniform. Hence our previous argument still 
applies, and tt will either enter an infinite loop or will repeatedly select tests in {t s £ q : 1 < k < m} 
until a test has an outcome of 1. Thus in expectation tt costs at least c(tt) > ^ YlT=i z = (m-\-l)/2. 
Since m = 2 q , n = 2m, and c(tt*) < c(tt') = q + 1 = log 2 (n) we infer 




which completes the proof. 



□ 
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